68 research outputs found
One Homonym per Translation
The study of homonymy is vital to resolving fundamental problems in lexical
semantics. In this paper, we propose four hypotheses that characterize the
unique behavior of homonyms in the context of translations, discourses,
collocations, and sense clusters. We present a new annotated homonym resource
that allows us to test our hypotheses on existing WSD resources. The results of
the experiments provide strong empirical evidence for the hypotheses. This
study represents a step towards a computational method for distinguishing
between homonymy and polysemy, and constructing a definitive inventory of
coarse-grained senses.Comment: 8 pages, including reference
A Fast Method for Parallel Document Identification
We present a fast method to identify
homogeneous parallel documents. The
method is based on collecting counts of
identical low-frequency words between
possibly parallel documents. The candidate with the most shared low-frequency
words is selected as the parallel document.
The method achieved 99.96% accuracy
when tested on the EUROPARL corpus
of parliamentary proceedings, failing only
in anomalous cases of truncated or otherwise distorted documents. While other
work has shown similar performance on
this type of dataset, our approach presented here is faster and does not require
training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary
proceedings may be inappropriate for testing parallel document identification systems in general
One Sense Per Translation
The idea of using lexical translations to define sense inventories has a long
history in lexical semantics. We propose a theoretical framework which allows
us to answer the question of why this apparently reasonable idea failed to
produce useful results. We formally prove several propositions on how the
translations of a word relate to its senses, as well as on the relationship
between synonymy and polysemy. We empirically validate our theoretical findings
on BabelNet, and demonstrate how they could be used to perform unsupervised
word sense disambiguation of a substantial fraction of the lexicon
A Fast Method for Parallel Document Identification
We present a fast method to identify
homogeneous parallel documents. The
method is based on collecting counts of
identical low-frequency words between
possibly parallel documents. The candidate with the most shared low-frequency
words is selected as the parallel document.
The method achieved 99.96% accuracy
when tested on the EUROPARL corpus
of parliamentary proceedings, failing only
in anomalous cases of truncated or otherwise distorted documents. While other
work has shown similar performance on
this type of dataset, our approach presented here is faster and does not require
training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary
proceedings may be inappropriate for testing parallel document identification systems in general
The Application of Chordal Graphs to Inferring Phylogenetic Trees of Languages
Phylogenetic methods are used to build
evolutionary trees of languages given
character data that may include lexical,
phonological, and morphological information. Such data rarely admits a perfect
phylogeny. We explore the use of the
more permissive conservative Dollo phylogeny as an alternative or complementary
approach. We propose a heuristic search
algorithm based on the notion of chordal
graphs. We test this approach by generating phylogenetic trees from three datasets,
and comparing them to those produced by
other researchers
Visually-Grounded Descriptions Improve Zero-Shot Image Classification
Language-vision models like CLIP have made significant progress in zero-shot
vision tasks, such as zero-shot image classification (ZSIC). However,
generating specific and expressive class descriptions remains a major
challenge. Existing approaches suffer from granularity and label ambiguity
issues. To tackle these challenges, we propose V-GLOSS: Visual Glosses, a novel
method leveraging modern language models and semantic knowledge bases to
produce visually-grounded class descriptions. We demonstrate V-GLOSS's
effectiveness by achieving state-of-the-art results on benchmark ZSIC datasets
including ImageNet and STL-10. In addition, we introduce a silver dataset with
class descriptions generated by V-GLOSS, and show its usefulness for vision
tasks. We make available our code and dataset
Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs
Large Language Models (LLMs) have demonstrated exceptional natural language
understanding abilities and have excelled in a variety of natural language
processing (NLP)tasks in recent years. Despite the fact that most LLMs are
trained predominantly in English, multiple studies have demonstrated their
comparative performance in many other languages. However, fundamental questions
persist regarding how LLMs acquire their multi-lingual abilities and how
performance varies across different languages. These inquiries are crucial for
the study of LLMs since users and researchers often come from diverse language
backgrounds, potentially influencing their utilization and interpretation of
LLMs' results. In this work, we propose a systematic way of qualifying the
performance disparities of LLMs under multilingual settings. We investigate the
phenomenon of across-language generalizations in LLMs, wherein insufficient
multi-lingual training data leads to advanced multi-lingual capabilities. To
accomplish this, we employ a novel back-translation-based prompting method. The
results show that GPT exhibits highly translating-like behaviour in
multilingual settings.Comment: Paper accepted to EMNLP 202
- …